Add support for using payloads to boost terms #3772

bdurand · 2013-09-24T15:42:27Z

It would be great to be able so have a mapping field which stores payloads with terms and be able to use the payloads to boost the score of the document.

In my particular use case, I have documents which are tagged by users and after running through filters and algorithms we can determine which tags are most likely useful and which are likely spam. We'd like to pass that information on to the search index so that we can boost the documents we think are most appropriate to the search terms. In this case the boost is known at indexing time and applies to the terms themselves and not to the field or the documents.

This is something that's been possible with Lucene for quite awhile and which Solr had partial support for, but never fully implemented out of the box. (See for example http://wiki.apache.org/solr/Payloads, http://searchhub.org/2009/08/05/getting-started-with-payloads/, http://hnagtech.wordpress.com/2013/04/19/using-payloads-with-solr-4-x/).

Ideally, it would be best to pass the payload in as a separate JSON field value in the document. The Solr tokenizer for payloads (DelimitedPayloadTokenFilterFactory) uses a delimiter, but I've found this to be problematic when dealing with user generated terms. In addition, it would be best to have the payload value somehow available in scripting so the payloads can be indexed once and then the scoring algorithms tweaked as necessary to get the right scores.

brwe · 2013-09-30T12:48:52Z

I think there are (at least) three issues here:
Taking the payloads into account for scoring could indeed be useful. I will try to come up with something. However, I would like to know how you believe the payload should affect the score. Since the same token can have different payloads, would you have an average of these numbers, the max, the min,...?

As for how to get the payloads in, I believe this is a different issue. It would be easy to expose the DelimitedPayloadTokenFilter in elasticsearch but passing the payloads in while indexing the document might be more tricky. If you desperately need that, could you open a new issue for that?

I do not fully understand how scipting support for payloads should work. Can you elaborate on this a bit or come up with an example?

bdurand · 2013-10-09T17:22:04Z

In my mind the scripting and scoring are tied together simply because I believe this is the kind of issue where you'd need to play with the data after indexing it to get the right scoring. Since the payload would need to be added at indexing time, it would be much easier if the "payload score" could be exposed to a scoring script used for ordering.

My particular use case in detail would be:

While indexing documents, count the number of times each distinct tag has been applied by users to the document. This value would be included with each tag indexed for the document to indicate the weight for that particular tag.
When searching, we would apply the previously defined weights for the tags (terms) in a custom scoring script.

We haven't worked out the actual algorithms yet and it is definitely something we'd need to play around with to get the right values. I would imagine the payload, though, would likely be a number between 0.0 and 1.0 indicating the confidence that the term was an accurate one.

brwe · 2013-10-24T14:30:27Z

Sorry for the late reply:
I agree that having the payloads available in a script would indeed be helpful to evaluate different scoring functions. But be warned: It will be very slow and only be good for prototyping.

I think the easiest way to do this is to simply make all term information for a document available in scripts. It would be similar to the term vector api. You would then have the freedom to choose any kind of document features for scoring.
What do you think?

brwe · 2013-11-13T13:55:35Z

I made the pull request (#4161) that allows to access payloads amongst other term information in a script. If you are still interested, take a look and see if this is useful for you!

bdurand · 2013-11-13T17:08:08Z

Looks great! Thank you.

term statistics can be accessed via the _shard variable. Below is a minimal example. See documentation on details. ``` DELETE paytest PUT paytest { "mappings": { "test": { "_all": { "auto_boost": true, "enabled": true }, "properties": { "text": { "index_analyzer": "fulltext_analyzer", "store": "yes", "type": "string" } } } }, "settings": { "analysis": { "analyzer": { "fulltext_analyzer": { "filter": [ "my_delimited_payload_filter" ], "tokenizer": "whitespace", "type": "custom" } }, "filter": { "my_delimited_payload_filter": { "delimiter": "+", "encoding": "float", "type": "delimited_payload_filter" } } }, "index": { "number_of_replicas": 0, "number_of_shards": 1 } } } POST paytest/test/1 { "text": "the+1 quick+2 brown+3 fox+4 is quick+10" } POST paytest/test/2 { "text": "the+1 quick+2 red+3 fox+4" } POST paytest/_refresh POST paytest/_search { "script_fields": { "ttf": { "script": "_shard[\"text\"][\"quick\"].ttf()" } } } POST paytest/_search { "script_fields": { "freq": { "script": "_shard[\"text\"][\"quick\"].freq()" } } } POST paytest/test/2/_termvector POST paytest/_search { "script_fields": { "payloads": { "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;" } } } POST paytest/_search { "script_fields": { "tv": { "script": "_shard[\"text\"][\"quick\"].freq()" } }, "query": { "function_score": { "functions": [ { "script_score": { "script": "_shard[\"text\"][\"quick\"].freq()" } } ] } } } ``` closes elastic#3772

term statistics can be accessed via the _shard variable. Below is a minimal example. See documentation on details. ``` DELETE paytest PUT paytest { "mappings": { "test": { "_all": { "auto_boost": true, "enabled": true }, "properties": { "text": { "index_analyzer": "fulltext_analyzer", "store": "yes", "type": "string" } } } }, "settings": { "analysis": { "analyzer": { "fulltext_analyzer": { "filter": [ "my_delimited_payload_filter" ], "tokenizer": "whitespace", "type": "custom" } }, "filter": { "my_delimited_payload_filter": { "delimiter": "+", "encoding": "float", "type": "delimited_payload_filter" } } }, "index": { "number_of_replicas": 0, "number_of_shards": 1 } } } POST paytest/test/1 { "text": "the+1 quick+2 brown+3 fox+4 is quick+10" } POST paytest/test/2 { "text": "the+1 quick+2 red+3 fox+4" } POST paytest/_refresh POST paytest/_search { "script_fields": { "ttf": { "script": "_shard[\"text\"][\"quick\"].ttf()" } } } POST paytest/_search { "script_fields": { "freq": { "script": "_shard[\"text\"][\"quick\"].freq()" } } } POST paytest/test/2/_termvector POST paytest/_search { "script_fields": { "payloads": { "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;" } } } POST paytest/_search { "script_fields": { "tv": { "script": "_shard[\"text\"][\"quick\"].freq()" } }, "query": { "function_score": { "functions": [ { "script_score": { "script": "_shard[\"text\"][\"quick\"].freq()" } } ] } } } ``` closes #3772

term statistics can be accessed via the _shard variable. Below is a minimal example. See documentation on details. ``` DELETE paytest PUT paytest { "mappings": { "test": { "_all": { "auto_boost": true, "enabled": true }, "properties": { "text": { "index_analyzer": "fulltext_analyzer", "store": "yes", "type": "string" } } } }, "settings": { "analysis": { "analyzer": { "fulltext_analyzer": { "filter": [ "my_delimited_payload_filter" ], "tokenizer": "whitespace", "type": "custom" } }, "filter": { "my_delimited_payload_filter": { "delimiter": "+", "encoding": "float", "type": "delimited_payload_filter" } } }, "index": { "number_of_replicas": 0, "number_of_shards": 1 } } } POST paytest/test/1 { "text": "the+1 quick+2 brown+3 fox+4 is quick+10" } POST paytest/test/2 { "text": "the+1 quick+2 red+3 fox+4" } POST paytest/_refresh POST paytest/_search { "script_fields": { "ttf": { "script": "_shard[\"text\"][\"quick\"].ttf()" } } } POST paytest/_search { "script_fields": { "freq": { "script": "_shard[\"text\"][\"quick\"].freq()" } } } POST paytest/test/2/_termvector POST paytest/_search { "script_fields": { "payloads": { "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;" } } } POST paytest/_search { "script_fields": { "tv": { "script": "_shard[\"text\"][\"quick\"].freq()" } }, "query": { "function_score": { "functions": [ { "script_score": { "script": "_shard[\"text\"][\"quick\"].freq()" } } ] } } } ``` closes elastic#3772

ghost assigned brwe Sep 30, 2013

brwe mentioned this issue Oct 9, 2013

Enable delimited payload token filter #3859

Closed

brwe mentioned this issue Nov 13, 2013

make term statistics and term vectors accessible in scripts #4161

Closed

brwe closed this as completed in 1ede9a5 Jan 2, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for using payloads to boost terms #3772

Add support for using payloads to boost terms #3772

bdurand commented Sep 24, 2013

brwe commented Sep 30, 2013

bdurand commented Oct 9, 2013

brwe commented Oct 24, 2013

brwe commented Nov 13, 2013

bdurand commented Nov 13, 2013

Add support for using payloads to boost terms #3772

Add support for using payloads to boost terms #3772

Comments

bdurand commented Sep 24, 2013

brwe commented Sep 30, 2013

bdurand commented Oct 9, 2013

brwe commented Oct 24, 2013

brwe commented Nov 13, 2013

bdurand commented Nov 13, 2013